Statistical Approach to Multipitch Analysis

نویسنده

  • Hirokazu Kameoka
چکیده

We deal through this paper with the problem of estimating “information” of each sound source separately from an acoustic signal of compound sound. Here “information” is used in a wide sense to include not only the waveform itself of the separate source signal but also the power spectrum, fundamental frequency (F0), spectral envelope and other features. Such a technique could be potentially useful for a wide range of applications such as robot auditory sensor, robust speech recognition, automatic transcription of music, waveform encoding for the audio CODEC (compression-decompression) system, a new equalizer system enabling bass and treble controls for separate source, and indexing of music for music retrieval system. Generally speaking, if the compound signal were separated, then it would be a simple matter to obtain an F0 estimate from each stream using a single voice F0 estimation method and, on the other hand, if the F0s were known in advance, could be very useful information available for separation algorithms. Therefore, source separation and F0 estimation are essentially a “chicken-and-egg problem”, and it is thus perhaps better if one could formulate these two tasks as a joint optimization problem. In Chapter 2, we introduce a method called “Harmonic Clustering”, which searches for the optimal spectral masking function and the optimal F0 estimate for each source by performing the source separation step and the F0 estimation step iteratively. In Chapter 3, we establish a generalized principle of Harmonic Clustering by showing that Harmonic Clustering can be understood as the minimization of the distortion between the power spectrum of the mixed sound and a mixture of spectral cluster models. Based on this fact, it becomes clear that this problem amounts to a maximum likelihood problem with the continuous Poisson distribution as the likelihood function. This Bayesian reformulation enables us not only to impose empirical constraints, which are usually necessary for any underdetermined problems, to the parameters by introducing prior probabilities but also to derive a model selection criterion, that leads to estimating the number of sources. We confirmed through the experiments the effectiveness of the two techniques introduced in this chapter: multiple F0 estimation and source number estimation. iv Human listeners are able to concentrate on listening to a target sound without difficulty even in the situation where many speakers are talking at the same time or many instruments are played together. Recent efforts are being directed toward the attempt to implement this ability by human called the “auditory stream segregation”. Such an approach is referred to as the “Computational Auditory Scene Analysis (CASA)”. In Chapter 4, we aim at developing a computational algorithm enabling the decomposition of the time-frequency components of the signal of interest into distinct clusters such that each of them is associated with a single auditory stream. To do so, we directly model a spectro-temporal model whose shape can be taken freely within the constraint called “Bregman’s grouping cues”, and then try to fit the mixture of this model to the observed spectrogram as well as possible. We call this approach “Harmonic-Temporal Clustering”. While most of the conventional methods usually perform separately the extraction of the instantaneous features at each discrete time point and the estimation of the whole tracks of these features, the method described in this chapter performs these procedures simultaneously. We confirmed the advantage of the proposed method over conventional methods through experimental evaluations. Although many efforts have been devoted to both F0 estimation and spectral envelope estimation intensively in the speech processing area, the problem of determining F0 and spectral envelope seems to have been tackled independently. If the F0 were known in advance, then the spectral envelope could be estimated very reliably. On the other hand, if the spectral envelope were known in advance, then we could easily correct subharmonic errors. F0 estimation and spectral envelope estimation, having such a chicken and egg relationship, should thus be done jointly rather than independently with successive procedures. From this standpoint, we will propose a new speech analyzer that jointly estimates pitch and spectral envelope using a parametric speech source-filter model. We found through the experiments a significant advantage of jointly estimating F0 and spectral envelope in both F0 estimation and spectral envelope estimation. The approaches of the preceding chapters are based on the approximate assumption of additivity of the power spectra (neglecting the terms corresponding to interferences between frequency components), but it becomes usually difficult to infer F0s when two voices are mixed with close F0s as far as we are only looking at the power spectrum. In this case not only the harmonic structure but also the phase difference of each signal becomes an important cue for separation. Moreover, having in mind future source separation methods designed for multi-channel signals of multiple sensory input, analysis methods in the complex spectrum v domain including the phase estimation are indispensable. Taking into account the significant effectiveness and the advantage of the approach described in the preceding chapters, we have been motivated to extend it to a complex-spectrum-domain approach without losing its essential characteristics. The main topic of Chapter 6 is the development of a nonlinear optimization algorithm to obtain the maximum likelihood parameter of the superimposed periodic signal model: focusing on the fact that the difficulty of the single tone frequency estimation or the fundamental frequency estimation, which are at the core of the parameter estimation problem for the sinusoidal signal model, comes essentially from the nonlinearity of the model in the frequency parameter, we introduce a new iterative estimation algorithm using a principle called the “auxiliary function method”. This idea was inspired by the principle of the EM algorithm. Through simulations, we confirmed that the advantage of the proposed method over the existing gradient descent-based method in the ability to avoid local solutions and the convergence speed. We also confirmed the basic performance of our method through 1ch speech separation experiments on real speech signal.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hierarchical Bayesian Model of Chords, Pitches, and Spectrograms for Multipitch Analysis

This paper presents a statistical multipitch analyzer that can simultaneously estimate pitches and chords (typical pitch combinations) from music audio signals in an unsupervised manner. A popular approach to multipitch analysis is to perform nonnegative matrix factorization (NMF) for estimating the temporal activations of semitone-level pitches and then execute thresholding for making a pianor...

متن کامل

A Classification Approach to Multipitch Analysis

This paper proposes a pattern classification approach to detecting the pitches of multiple simultaneous sounds. In order to deal with the octave ambiguity in pitch estimation, a statistical classifier is trained which observes the value of a detection function both at the position of a candidate pitch period and at its integer multiples and submultiples, in order to decide whether the candidate...

متن کامل

Monaural Voiced Speech Separation with Multipitch Tracking

Separating voiced speech from its mixtures with interferences in monaural condition is not only an important but also challenging task. As multipitch tracking can enable much better performance of speech separation for CASA systems, we propose a new multipitch determination algorithm, which can be used under various kinds of noise conditions. In the process of multipitch estimation, a new repre...

متن کامل

Bayesian Nonnegative Harmonic-Temporal Factorization and Its Application to Multipitch Analysis

Since important musical features are mutually dependent, their relations should be analyzed simultaneously. Their Bayesian analysis is particularly important to reveal their statistical relation. As the first step for a unified music content analyzer, we focus on the harmonic and temporal structures of the wavelet spectrogram obtained from harmonic sounds. In this paper, we present a new Bayesi...

متن کامل

A Multipitch Approach to Tonic Identification in Indian Classical Music

The tonic is a fundamental concept in Indian classical music since it constitutes the base pitch from which a lead performer constructs the melodies, and accompanying instruments use it for tuning. This makes tonic identification an essential first step for most automatic analyses of Indian classical music, such as intonation and melodic analysis, and raga recognition. In this paper we address ...

متن کامل

Multipitch tracking using a factorial hidden Markov model

In this paper, we present an approach to track the pitch of two simultaneous speakers. Using a well-known feature extraction method based on the correlogram, we track the resulting data using a factorial hidden Markov model (FHMM). In contrast to the recently developed multipitch determination algorithm [1], which is based on a HMM, we can accurately associate estimated pitch points with their ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007